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Method and Apparatus for Acceleration of Cryptographic Co-processing 

Peripherals 

FIELD OF THE INVENTION 
The present invention relates to apparatus operative to accelerate cryptographic 
co-processing penpherals and additionally but not exclusively to the use of such an 
accelerated processing apparatus for polynomial based and pnme number field 
arithmetic, extending the range of computational fields of integers and width of serial 
input operands in modular arithmetic public key cryptographic co-processors designed 
for elliptic curve and rsa type computations 

BACKGROUND OF THE INVENTION 
Security enhancements and performance accelerations for computational devices 
are described in Applicant's U.S, Patents 5,742,530, hereinafter '*P]", 5,513,133, 
5,448,639, 5,261,001; and 5,206,824 and published PCT patent application 
PCT/IL98/00148 (WO98/5085] ); and U.S. Patent application 09/050958, Onyszchuk et 
al's U.S. Patent 4,745,568; Omura et aPs U.S. Patent 4,5877,627, and applicant's U.S. 
Patent Application 09/480,102; the disclosures of which are hereby incorporated by 
reference. Applicant's U.S. Patent 5,206,824 shows an eariy apparatus operative to 
implement polynomial based multiplication and squaring, which carmot perform 
operations in the prime number field, and is not designed for interleaving in polynomial 
based computations. An additional analysis is made of an approach to use the extension 
field in polynomial based arithmetic in Paar, C, F. Fleischmann and P. 
Soria-Rodriguez, "Fast Arithmetic for Public-Key Algorithms in Galois Fields with 
Composite Exponents", EEE Transactions on Computers, vol. 48, No. 10, October 
1999, henceforth "Paar". 

SUMMARY OF THE INVENTION 
It is an aim of the present invention to provide a microelectronic specialized 
arithmetic unit operative to perform large number computations in the polynomial 
based and prime integer based number fields, using similar anticipating methods for 
simultaneously performing interleaved modular multiplication and reduction on varied 
radix multipliers. 

I 



A further aim of the invention is to provide a compact microelectronic specialized 
arithmetic logic unit, for performing modular and normal (natural, non-negative field of 
integers) multiplication, division, addition, subtraction and exponentiation over ver}' 
large integers. When referring to modular multiplication and squaring using both 
iVlontgomery methods and a reversed format method for simplified polynomial based 
multiplication and squaring, reference is made to the specific parts of the device as a 
superscalar modular arithmetic coprocessor, or SMAP, or MAP. also as relates to 
enhancements existing in the applicant's U.S. Patent pending 09/050,958 filed March 
31. 1998, 

According to a first aspect of the present invention there is thus provided 
microelectronic apparatus for performing ® multipHcation and squaring in both 
polynomial based GF(2^) and GF{p) field arithmetic, squaring and reduction using a 
serial fed radix 2^ multiplier, B, with k character multiplicand segments, A\, and a k 
character © accumulator wherein reduction to a limited congruence is performable "on 
the tly". in a systolic manner, with Ai, a multiplicand, times B, a muUiplier, over a 
modulus. .V. and a result being at most 2k + 1 characters long, including k first emitting 
disregarded zero characters, which are not saved, where k characters have no less bits 
than the modulus, wherein said operations are carried out in two phases, the apparatus 
comprising; 

a first (B), and second (N) main memory register, each register operative to hold at 
least n bit long operands, respectively operative to store a multiplier value designated B, and 
a modulus, denoted yV, wherein the modulus is smaller than 2"; 

a digital logic sensing detector, Kq, operative to anticipate "on the fly" when a modulus 
value is to be © added to the value in the © adder accumulator device such that all first k 
characters emitted from the device are forced to zero; 

a modular multiplying device for at least k character input multiplicands, with only 
one. at least k characters long © adder, a ©summation device operative to accept k 
character multiplicands, the ® multiplication device operative to switch into the 
©accumulator device multiplicand xalues in turn, and in turn to receive multiplier 
values from a B register, and an "on the fly" simultaneously generated anticipated value 
as a multiplier which is operative to force k first emitting zero output characters in the 
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first phase, wherein at each effective machine cycle at least one designated multiplicand 
is © added into the © accumulation device; 

the multiplicand values to be switched in turn into the ©accumulation device 
consisting of one or two of the following three multiplicands, the first mukiplicand 
being an all-zero string value, a second value, being the multiplicand A^, and a third 
\'alue. the A'o segment of the modulus; 

an anticipator to anticipate 1 bit /: character serial input Yq multiplier values; 

the apparatus being operable to input in turn multiplier values into the multiplying 
device in the first phase, said values being first the B operand, and concurrently, the 
second multiplier value consisting of the Vq, "on the fly" anticipated k character string, 
to force first emitted zeroes in the output; 

the apparatus further comprising an © accumulation device, operative to output 
values simultaneously as multiplicands axe © added into the © accumulation device; 
and 

an output transfer mechanism, operative in the second phase to output a final 
modular ® multiplication result from the @ accumulation device. 

According to a preferred embodiment © summations into the © accumulation 
device are activated by each one of a series of successively newly serially loaded higher 
order multiplier character values. 

Preferably, the multiplier characters are operative to cause no © summation into 
the © accumulation device if both the input B character and the corresponding input )'o 
character are zeroes; 

are operative to ©add in only the Ai multiplicand if the input B character is a one 
and the corresponding Yq character is a zero; 

are operative to ©add in only the N, modulus, if the B character is a zero, and the 
corresponding character is a one; and 

are operative to ©add in the ©summation of the modulus, yV, with the 
multiplicand .4, if both the B input character and the corresponding Yo character are 
ones. 

Preferably, the apparatus is operative to preload multiplicand values A^ and yV, into 
two designated preload buffers, and to © summate these values into a third multiplicand 
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preload buffer, obviating the necessity of ©adding in each multiplicand value 
separatelw 

[Preferably the multiplier character values are arranged for input in serial single 
character form, the © accumulation device is arranged for output in serial single 
character form, and wherein the )o detect device is operative to anticipate only one 
character in a clocked turn. 

Preferably, the © accumulation device is operable to perform modulo 2, XOR 
addition/subtraction, and wherein all carry bits in addition and subtraction components 
are disregardable, thus not needing provisions for overflow and further Hmiting modular 
reduction in computations. 

In a preferred embodiment, carry inputs are disabled to zero, and being operative 
to perform polynomial based multiplication. 

Preferably, the apparatus is operative to provide non-carr>^ arithmetic by setting S 
equal to zero acting on an element in a circuit equation computing in GF(2'^), 

Preferably, the apparatus is operative to provide non-carr>' arithmetic by omitting 
carry circuitry such that the S designates omitted circuitrv' and reducing adders and 
subtractors. designated © to XOR, modulo 2 addition/subtraction elements. 

A preferred embodiment is adapted such that the first k character segments emitted 
from the operational units are zeroes, zero forcing being controlled by the following four 
quantities in anticipating the next in turn Yq character: 

i the / bit Som bits of the result of the / bit by / bit mod 2^® multiplication of the 
right-hand character of the Ay register times the 5d character of the B Stream, 
.4o'5d mod 2'; 

ii the first emitting cany out character from the © accumulation device, S(COo); 

iii the / bit 5oui character from the second from the right character emitting cell of 

the © accumulation device, SO] \ 

iv the / bit Jo value, which is the negative multiplicative in\-erse of the right-hand 

character in the :Vo modulus multiplicand register, 
wherein values, .^o-^d mod 2^ S{COo). and SOi are © added character 10 
character together and "on the fly" multiplied by the Jq character to output a valid )'o 
zero-forcing anticipatory character to force an / bit egressing string of zeroes. 
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The apparatus is preferably operable to perform ® multiplication on polynomial 
based operands in a reverse mode, said multiplication comprising multiplying from risht 
hand MS characters to left hand LS characters, said apparatus further being operative to 
perform modular reduced ® multiplication without utilizing Montgomer>' type parasitic 
functions. 

Preferably, the apparatus funher comprises preload buffers, which buffers are 
serially fed and which are connected such that multiplicand values are preloadable into 
the preload buffers on the fly from one or more memory devices. 

The apparatus is preferably operable to © sum a previously emitted value from an 
additional n bit register, S, into the output value of the © accumulation device via a 1 bit 
© adder circuit such that first emitting output characters are zeroes, 

wherein the Yq detector is operative to detect any necessity of © adding moduli to 
the © summation in the © accumulation device, 

wherein the Yq detector is further operative to detect utilization of the next in turn 
©added characters Ao-Bc\ mod 2^ S(COo), S0\, and S{COz), and the composite of 
© added characters to be finite field ® multiplied on the fly by the / bit Jq value. 

Preferably, for / = 1. Jo is implicitly 1, and the /o ® multiplication is carried out 
implicitly. 

Preferably, a comparator is operative to sense a finite field output from the ® 
modular multiplication device whilst operating in G¥{p), wherein the first right hand 
emitting k zero characters are disregarded, wherein the output is larger than the 
modulus. /V, the apparatus thereby being operative to control a modular reduction 
whence said value is output from a memory register to which an output stream from the 
multiplier device is destined, and thereby not requiring a second memory storage device 
for smaller ones of resulting product values. 

Preferably, for ® modular multiplication in the GF(2^), the apparatus is operative 
to carr>' out multiplication without an externally precomputed more than 1 bit 
zero-forcing factor. 

A preferred embodiment is operative to compute a Jo constant by reseuing either 
the A operand value or the B operand value to zero and setting a partial result 
value. .S"(,. to I . 
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According to a second aspect of the present invention there is provided a 
microelectronic apparatus for performing interleaved finite field ® nnodular 
muliiplication of two integers A and B, so as to generate an output stream of ^ times B 
modulus .V. wherein a number of characters in a modulus operand register, n, is larger 
than a segment length of k characters wherein the ® modular multiplication is 
performed in a plurality of interleaved iterations, wherein at each interleaved iteration, 
operands are input into a ® multiplying device, said operands comprising any one of .'V, 
Z>. a prex iousiy computed panial result, S, and a k character string segment of the 
segments progressing from a first string segment Aq to a higher string segment Am-\, 
wherein each iterative result is © summated into a next temporary result, wherein at 
least first emitted characters of said iterations are zeroes, the apparatus comprising: 

first (B), second (5) and third (AO main memorv' registers, each register respectively 
operative to store a multiplier value, a partial result value and a modulus; 

a modular multiplying device operative to © summate into an © accumulation 
device, in turn one or two of a plurality of multiplicand values, during each one of a 
plurality of phases of the iterative ® multiplication process, and in turn to receive as 
multipliers, inputs from: 

said B register. 

an "on the fly" anticipating value (Vq) source, said anticipating value being usable 
as a multiplier to force first emitting right-hand zero output characters in each iteration, 
and 

said /V, register; 

the multiplicand parallel registers operative at least to receive in turn, values from 
the A. B. and N registers, and also said zero forcing (}o) value; 

the apparatus further comprising a zero forcing (Vq) detect device operative to 
generate a binar\' string operative to be a multiplier during a first multiplication phase 
and operative to be a multiplicand in a second multiplication phase; 

the apparatus being operable to obtain multiplicand values suitable for switching 
into the © accumulation device for the first multiplication phase, said values comprising 
firstly a zero value, secondly a value. A,, being a k character string segment ot a 
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nniliiplicand. /I. and a third value A'o, being the First emining k characters of the 
modulus. /V; 

the apparatus further being operable to utilize a temporary- result value. S, resulting 
from a previous iteration, to be © summated with a present result value emanating from 
the © accumulation device, to generate a partial result for a next-in-tum iteration: 

the apparatus funher being operable to utilize multiplicand \'alues to be input, in 
turn, into the © accumulation device for a second multiplication phase comprising 
firstly a zero value, secondly an A-^ operand, remaining in place from the first phase, and 
thirdly a )'o value having been anticipated in the first multiplication phase; 

multiplier values input into the multiplying device in the first phase being firstly 
an emitted string, Bo, said multiplication device being operable to multiply said string 
concurrently ® with a second ® multiplier value consisting of the anticipated Yq string 
which is simultaneously loaded character by character as it is generated into a preload 
multiplicand buffer for the second phase; 

two multiplier values operable to be input into the apparatnas during a second 
phase being left hand n-k character values from the B operand, designated 5, and the left 
hand n-k characters of the /V modulus, designated respectively; and 

w^herein said apparatus further comprises a multiplying flush out device operative 
in a last multiplication phase to transfer a left hand segment of a result value remaining 
in the © accumulation device into a result register. 

Preferably, the apparatus is operable to perform ® multiplication on polynomial 
based operands in a reverse mode, multiplying from MS characters to LS characters, and 
thereby being able to perform modular reduction without iMontgomer)- type parasitic 
functions. 

According to a third aspect of the present invention there is provided apparatus 
operative in modular multiplication to anticipate a )'o value using first emitted values of 
a multiplicand, . and present inputs of a B multiplier, carry out values from a 
© accumulation device, © summation values from the © accumulation device, the 
present \'alues from a previously computed partial result, and earn* out values from a 
©adder which © summates the result from the © accuniulation device with the 
previous partial result. 
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Prei'erabiy. the apparatus is adapted to ensure that k first emitted values from the 
device are zeroes, said adaptation comprising anticipation of a next in turn character 
using the following quantities: 

i / bit Soul bits of a result of / bit by / bit mod multiplication of the 
righi-hand character of an A-^ register times the character of the B Stream, 
AivD,\ mod 2^ 

ii a first emitting earn* out character from the 6 accumulation device, 5(COo); 

iii the / bit 5oui character from a second from the righi-hand character emitting 

cell of the © accumulation device, SO\ ; 

iv a next in turn character value from the S stream, 

V a 1 bit carry out character from a Z output full adder, S{COz)\ 
vi a / bit Jo value, which is a negative multiplicative inverse of a right-hand 
character in the No modulus multiplicand register; 
wherein values, A^-B^ mod 2^, S{COo), SO], 5d are ©. added character to 
character together and "on the fly" ® multiplied by the Jq character to output a valid Yq 
zero-forcing anticipatory^ character. 

In a further embodiment there is also provided at least one sensor operative 
to compare an output result to N, the mechanism operative to actuate a second 
subtractor on the output of the result register, thereby to output a modular reduced value 
which is limited congruent to the output result value, thereby avoiding any necessity to 
allot a second memor>' storage for a smaller result. 

In a yet further embodiment, a value which is a © summation of two 
multiplicands is loadable into a preload character buffer comprising at least a k character 
memory register whilst one of the said two multiplicands is concurrently loaded into 
another preload buffer. 

According to a fourth aspect of the present invention there is provided apparatus 
with one ® accumulation device, and an anticipating zero forcing mechanism, operative 
to perform a series of interleaved ® modular multiplications and squarings, and being 
adapted to perform concurrently the equivalent of three natural integer multiplication 
operations, such that a result is an exponentiation. 

In an embodiment, next in turn used multiplicands are preloaded into a 
preload register buffer on the Hy. 



VA > 

Preferably the apparatus is operable to ® sum two multiplicands into at least a k 
character register whilst concurrently loading one of the two multiplicands into a 
preload buffer. 

In a further embodiment, apparatus buffers and registers are operative to be 
loaded with values from external memor>' sources and to be unloaded into an external 
memory source during computations, such that a maximum size of the operands is 
independent of sizes of said registers and said buffers. 

In a vet further embodiment there is also pro\'ided a memor>' register, said 
memor\' register being typically serial -single -character -in /serial-single- character- 
out, parallel- at- least- k -characters- in/ parallel-at-least-^-characters-out, serial -single- 
character -in/ parallel -at -least ~k -characters -out, and parallel- k- characters- in/ 
serial- single- character- out. 

Preferably, the apparatus is operable to provide, during a final phase of a ® 
multiplication type iteration, at inputs of said multiplication device, a plurality of zero 
characters, which zero characters are operative to flush out a left hand segment of a 
memor>' of the carr}^ save © accumulator. 

Preferably, the apparatus is operable to preload next in turn multiplicands into 
preload memory buffers on the fly, prior to their being required in an iteration. 

Preferably, the apparatus is operable to preload multiplicand values into preload 
buffers on the fly from a central storage memor>'. 

According to a fifth aspect of the present invention there is provided a 
microelectronic method for performing interleaved finite field ® modular multiplication 
of two integers A and B, so as to generate an output stream of times B modulus yV, 
wherein a number of characters in a modulus operand register, n, is larger than a 
segment length of A: characters, wherein the ® modular multiplication is performed in a 
plurality of interleaved iterations, w-herein at each interleaved iteration, operands are 
input into a ® multiplying device, said operands comprising any one of N, B, a 
previously computed partial result, 5". and a k character string segment of A, a 
multiplicand, the segments progressing from a first string segment Aq to a higher siring 
segment .^m-i, wherein each iterative result is © summated into a next temporar>' result, 
wherein at least first emitted characters of said iterations are zeroes, the method 
comprising the steps of: 
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© summating into an © accumulation device, in turn one or two of a plurality of 
multiplicand values, during each one of a pluralit)- of phases of the iterative ® 
multiplication process, and in turn receiving as multipliers, inputs from: 

a B register. 

an "on the tly" anticipating value. Yq. operable as a multiplier to force first 
emitting right-hand zero output characters in each iteration, and 
an A', register: 

generating a binar\' string operative to be a multiplier during a first multiplication 
phase and operative to be a multiplicand in a second multiplication phase; 

obtaining multiplicand values suitable for switching into the © accumulation 
device for the first multiplication phase consisting firstly of a zero value, secondly a 
value. A,, which is a k character siring segment of a multiplicand, A, and thirdly a value 
A'o. being the first emitted k characters of the modulus, yV; 

obtaining a temporary result value, S, resulting from a previous iteration, and 
© summating said temporary result value, 5, with a present result value emanating from 
the © accumulation device, to generate a partial result for a next in turn iteration; 

obtaining multiplicand values for a second multiplication phase comprising firstly 
a zero value, secondly an A, operand, remaining in place from the first phase, and thirdly 
a ) 0 value having been anticipated in the first phase: 

utilizing multiplier values obtained in the first phase, said values being firstly an 
emitted string, Z?o, and multiplying said string concurrently ® with a second ® multiplier 
value consisting of the anticipated Yq string as it is simultaneously loaded character by 
character whilst being generated into a preload multiplicand buffer for the second phase; 

obtaining two multiplier values during the second multiplication phase, said 
values being left hand n-k character values from the B operand, designated 5, and the 
left hand n-k characters of the /V modulus, designated A/, respectively: and 

in a last multiplication phase transferring a left hand segment of a result value 
remaining in the © accumulation device into a result register. 

A preferred embodiment provides the additional step of computing Jo^^o for 1^' 
by resetting both A and B to zero and setting 5*0 = 1 ■ 
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Preferred embodiments of the invention described herein provide a modular 
computational operator for pubHc key cryptographic applications on portable Sman 
Cards, typically identical in shape and size to the popular magnetic stripe credit and 
bank cards. Similar Smart Cards as per applicant's technology of US Patent 5,513,133 
and 5.742.530, and applicants above-mentioned pending applications are being used in 
the new generation of public key cryptographic devices for controlling access to 
computers/ databases, and critical installations: to regulate and secure data flow in 
commercial. miliiar\" and domestic transactions: to decr>*pt scrambled pay television 
programs, etc. and as terminal devices for similar applications. Typically, these devices 
are also incorporated in computer and fax terminals, door locks, vending machines, etc. 

The preferred architecture is of an apparatus operative to be integrated to a 
multiplicity of microcontroller and digital signal processing devices, and also reduced 
instruction computational designs while the apparatus operates in parallel with the host 
processing unit. 

This embodiment preferably uses only one multiplying device which inherently 
serves the function of two multiplying devices, basically similar to the architecture 
described in applicant's US Patent No. 5,513,133 and further enhanced in U.S. Patent 
application 09/050,958 and PCT application PCT/IL98/0048. Using present 
conventional microelectronic technologies, the apparatus of the present invention may 
be integrated with a controlling unit with memories onto a 4 by 4.5 by 0.2 mm 
microelectronic circuit. 

The main difference between hardware implementations in the two fields is that 
polynomial based additions and subtractions are simple XOR logic operations, without 
carry signals propagating from LS to MS. Consequently, there is no interaction between 
adjacent cells in the hardware implementation, and subtraction and addition are identical 
procedures. The earliest public notice that the authors are aware of was a short lecture 
by iVIarco Bucci of the Fondazione Ugo Bordoni, at the EurocPr'pt Conference Rump 
Session in Perugia, Italy, in 1994. 

Previous applicant's apparati were typically prepared to efficiently compute 
elliptic curve cr>'ptographic protocols in the GF(p) field. In this invention we note that 
as there is no interaction between adjacent binary- bits in the polynomial field, 
compulations can be processed efficiently, simultaneously performing reduction and 
multiplication on a superscalar multiplication device, Multiplication is preferably 
performed as one might do with pencil and paper, starting from the most significant 



partial products. Reduction is preferably performed by adding as many moduli as are 
necessar\' to reset ones to zeroes. As there is no cam' out in these additions, our results 
are automatically modularly reduced. In this invention polynomial computations are 
preferably performed using the same architecture, wherein the operands may be fed in 
MS characters first, and wherein all internal carr>' signals may be forced to zero. 

In a preferred embodiment of the present invention, the architecture has been 
extended to allow for a potentially faster progression, in that serial multipliers and 
results are now / character wide. This has somewhat complicated the anticipation 
process ( Yo). in that for single bit wide buses, an inversion of an odd number over a 
mod 2^ base is also an odd number, and the least significant bit of the Jo multiplicand 
was always a one. For both number fields, however the reduction process is identical, 
assuming the switch out of carries, if we remember that our only aim is to output a k 
character zero string, and we regard the function only as a zero forcing coefficient. 

The present invention also seeks to provide an architecture for a digital device 
which is a peripheral to a conventional digital processor, vvdth computational, logical 
and architectural novel features relative to the processes described in US Patent 
5,513,133. 

A concurrent process and a hardware architecture are provided, to perform 
modular exponentiation without division preferably with the same number of operations 
as are typically performed with a classic multiplication/division device, wherein a 
classic device typically performs both a large scale multiplication and a division on each 
operation. A particular feature of a preferred embodiment of the present invention is the 
concurrency of larger scale aniticipalor>' zero forcing functions, the extension of number 
fields, and the ability to integrate this type of unit for safe communications. 

The advantages realized by a preferred embodiment of this invention result from a 
synchronized sequence of serial processes. These processes are merged to 
simultaneously (in parallel) achieve three multiplication operations on n character 
operands, using one multiplexed k character serial/parallel multiplier in n effective clock 
cycles, where the left hand final k characters of the result reside in the output buffer of 
the multiplication device. This procedure accomplishes the equivalent of three 
multiplication computations in both fields, as described by .Montgomery, for the prime 
number field. 

By synchronizing loading of operands into the MAP and on the fly detecting 
values of operands, and on the fly preloading and simultaneous addition of next to be 



used operands, the apparatus is operative to execute computations in a deterministic 
fashion. All multiplications and exponentiations are executed in a predetermined 
number of clock cycles. Additional circuitry is preferably . added which, on the fly, 
preloads three first k character variables for a next iteration squaring sequence. A 
detection device is preferably provided where only two of the three operands are chosen 
as next iteration multiplicands, eliminating k effective clock cycle wait states. 
Conditional "branches are replaced with local detection and compensation devices, 
thereby providing a basis for a simple control mechanism. The basic operations herein 
described may typically be executed in deterministic time using a device described in 
US Patent 5.513.133 to Gressel et al or devices as by STMicroelectronics in Rousset, 
France, under the trade name ST19-CF58. 

An apparatus according to the above-described embodiments has particularly lean 
demands on external volatile memor>' for most operations, as operands are loaded into 
and stored in the device for the total length of the operation. The apparatus preferably 
exploits the CPU onto which it is appended, to execute simple loads and unloads, and 
sequencing of commands to the apparatus, whilst the MAP performs large number 
computations. Large numbers presently being implemented on smart card appHcations 
range from 128 bit to 2048 bit natural applications. The exponentiation processing time 
is virtually independent of the CPU which controls it. In practice, architectural changes 
are typically unnecessary when appending the apparatus to any CPU. The hardware 
device is self-contained, and is preferably appended to any CPU bus. 

In general, the present invention also relates to arithmetic processing of large 
integers. These large numbers are typically in the natural field of (non-negative) integers 
or in the Galois field of prime numbers, GF(p), and also of composite prime moduli. 
More specifically, a preferred embodiment of the present invention seeks to provide a 
device that can implement modular exponentiation of large numbers. Such a device is 
suitable for performing the operations of Public Key Cr>^ptographic authentication and 
encpv'ption protocols, which, in the prime number field, work over increasingly large 
operands and which cannot be executed efficiently with present generation modular 
arithmetic coprocessors. Furthermore they cannot be executed securely in software 
implementations. The same general architecture is used in elliptic cur\'e 
implementations for shorter operands, and here polynomial arithmetic may 
advantageously be used in order to get the right answer the first time, without the burden 
of the parasitic 2""' factor which is discussed at length in the incorporated documents. 



The archiieciure may offer a modular implementation of large operand integer 
arithmetic, while allowing for normal and smaller operand arithmetic simply by 
widening the serial single character bus, i.e., use of a larger radix. 

For modular multiplication in the prime and composite field of odd numbers, A 
and B are defmed as the multiplicand and the multiplier, respectively, and N is defined 
as the modulus in modular arithmetic. N, is typically larger than A or B. N also denotes 
the register where the value of the modulus is stored. A', is. in some instances, fypicfally 
smaller than A. A. B. and N are rv'pically n characters long, where characters are typically 
one to 8 bits long, k is the number of 1 bit characters in the size of the group defined by 
the size (number of cells) of the multiplying device. 

In the prime field, =, or in some instances =, is used to denote congruence of 
modular numbers, for example 16 = 2 mod 7. 16 is termed ''congruent" to 2 modulo 7 as 
2 is the remainder when 16 is divided by 7. When Y mod N = X mod N; both Y and X 
may be larger than N; however, for positive X and Y, the remainders are identical. Note 
also that the congruence of a negative integer Y, is Y ^ u N, where N is the modulus, 
and if the congruence of Y is to be less than N, u is the smallest integer which gives a 
positive result. 

In GF(2^) congruence is much simpler, as addition and subtraction are identical, 
and the computations of embodiments of the present invention preferably never leave a 
substantial overflow. For = 1101 and ^ = 1 00 1 , as the left hand MS bit of ^ is 1 , we 
must reduce ("subtract") yV from A by using modulo 2 arithmetic. A XOR /V = 1001 
XOR 1 101 = 0100. 

The Yen symbol. ¥. is used hereinbelow to denote congruence in a limited sense, 
especially useful in GF{p). During the processes described herein, a value is often either 
the desired value, or equal to the desired value plus the modulus, for example in X ¥ 2 
mod 7. X can be equal to 2 or 9. X is defined to have limited congruence to 2 mod 7. 
When the Yen symbol is used hereinbelow as a superscript, as in B^, then 0 < < 2 N, 
or stated differently. B^ is either equal to the smallest positive B which is congruent to 
B^. or is equal to the smallest positive congruent B plus N, the modulus. 

When A' = A mod A'', X is defined as the remainder of A divided by A'; 
e.g.. 3 = 45 mod 7. and much simpler in GF(2'*) - 1 1 1 1 mod 1001 = 01 10. 
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In number theor>', the modular multiplicative inverse of X is wTitten as X"', which 
is defmed by X X'^ mod yV= 1. If X= 3, and N = 13, then X-' =9, i.e., the remainder of 
3-9 divided by 13 is 1 in GF(p). 

For both number fields, we typically choose to compute the multiplicative inverse 
of ,^ using the exponential function, A'^ mod q = ~ ~ mod q. 

The acronyms MS and LS are used to signify "most significant" and ''least 
significant", respectively, when referencing bits, characters, and full operand values, as 
is conventional in digital nomenclature, but in the reversed mode polynomial base, 
operands are loaded MS data first and LS last, such that the bit order of the data word is 
reversed when loaded. . 

Throughout this specification N designates both the value yV. and the name of the 
shift register which stores /V. An asterisk superscript on a value, denotes that the value, 
as stands, is potentially incomplete or subject to change. A is the value of the number 
which is to be exponentiated, and n is the bit length of the A' operand. After initialization 
when A is ''Montgomerj' P field normalized" to ^ * (A*=2^A - explained in PI) ^4* and 
are typically constant values throughout the intermediate step in the exponentiation. In 
GF(2^) computations where computations might be performed with the normal 
unreversed positioning of bits we would be bound by this same protocol. However, 
using the reversed format, our computations generate most significant zeroes, which are 
disregarded, and do not represent a multiplication shift, as there is no carry out. 

During a first iteration, after initialization of an exponentiation. B is equal to A*, B 
is also the name of the register wherein the accumulated value that finally equals the 
desired result of exponentiation resides. 5" or 5* designates a temporarv' value; and S 
designates the register or registers in which all but the single MS bit of a GF(p) S is 
stored. (\S'* concatenated with this MS bit is identical to S.) S(\-\) denotes the value of 5 
at the outset of the i'th iteration. In these polynomial computations there is no need to 
perform modular reduction on 5. 

Montgomery multiplication of X and V in the prime number field is actually 
(A'')'"-2'"') mod A'', where n is typically the number of characters in a modulus. This is 
written. P(A B)n. and denotes MM or multiplication in the P field. In the context of 
Montgomer>- mathematics, we refer to multiplication and squaring in the P field as 
multiplication and squaring operations. 

We may redefine this innovative extension of Montgomer>' arithmetic in GF(2^) to 
mean a reversed format data order, wherein MS zero forcing does not change 
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congruence, or initiate a burdensome parasitic factor. We may thus introduce a new set 
of symbols to accommodate the thus described arithmetic extension, and to use wider 
serial multiplier buses. Such an architecture is potentially vital for enabling a natural 
integer superscalar multiplier, which may accept 32 bit multiplicands and 4 bit 
multipliers, into an apparatus that can perform modular arithmetic multiplications and 
reductions simultaneously. 

Symbols in the Serial - Parallel Super Scalar Modular Multiplier Enhancement 
/ Number of bits in a character ( digit ). 
/- Radix of multiplier character, r = 2*. 

n Size of operands (multiplier, multiplicand and modulus) in characters. 

k Length of serial-parallel multiplier in characters. 

m Number of interleaved slices of multiplicand, m = {n/k). 

S, Result of /'th iteration; 0 < / < ^^-1 ; So = 0. 

(5i)o The right hand character of the i'th result, after disregarding the first right 
hand zeroes. 

Si The left hand n-k characters of the i'th result. 

j'th character of S,. 
A Parallel multiplicand consists of m-k characters. 
A, The /'th k character slice of ^. 

A, , The /'th character of .^i. 
B Serial multiplier. 

Ba First right hand k characters of B. 

B Last left hand {n-k) characters of B. * - 

ZJoj'th character of 5o. 

B, /"th character of 5. 

N Modulus operand. ( Often denotes both the operand and the register. ) 

/V'o The Right Hand k characters of yV. 

[ LS characters in GF(/?); MS characters in GF(2'') } 

N (n-k) Left Hand characters of A''. 

( MS characters in GF(p); LS characters in GF(2'') } 

./Vq,/'th character of No- 



A , /'th character ofyV. 

) o Zero forcing variable required for both Montgomer>' G¥(p) and GF(2'') 
multiplication. Vq is ^ characters long. 
)n, /'th character of Yo. 

./,)(, Zero forcing character function of the modulus, A', for "on the fly" finite field 
multiplication and reduction, 

Ca/vv, /'th internal caiT>' character of radix r serial-parallel multiplier. 
Carr\\ radix r carry of output serial adder for GY{p) computations. 
.S'i/;?7, /'th interna! sum character of radix r serial-parallel multiplier. 
LS Least Significant. 
iMS Most Significant. 

|! Concatenation, e.g. A = 1 10, B = 1 101 ; A || B = 1 101 101. 

Right HandThe Least Significant portion of all G¥{p) computation data blocks 
and the Most Significant portion of the reversed GF(2'*) format, conversely Left Hand 
definition. 

GF(/?) Galois Field, strictly speaking finite fields over prime numbers where we 
also use composite integers that allow for addition, subtraction, multiplication and 
division. 

GF(2*^) Galois Fields using modulo 2 modular arithmetic. 
© A generic operator or device which may be switched to add or subtract 
integers with or without carries as befits the number system. 

® A generic operator or device which may be switched to multiplication over 
GV[p) or multiplication over GF(2'^). 

.S" The number field switch. 

.S" = L the switch is operative to enable all earn* in/outputs for GF(p) 
compulations: 

5 = 0. the switch is operative to disable all carr>' in/outputs for GF(2'^) 
computations. 

A Serial - Parallel Super Scalar Montgomery Multiplier computes a Montgomerv' 
modular product in three phases, wherein the last phase may be a single clock dump of 
the left hand segment in the CSA (MS for normal multiplications, LS for reverse mode 
polynomial compulations.) 



So = 0 : partial product initially zero. 
For / = 0 to /77 - 1 : at each interleave - 

The process of the first phase is the generic interaction of the operands 

( 5o and IVare serially character by character fed into the multiplier, A, and No are 
parallel operands) 

The first phase process implements a © summation of the two superscalar 
products with the right hand character of the previous result. A k zero character string is 
emitted from the multiplying device, and is disregarded; and a partial result is preferably 
left in the device buffer which is summated into the second phase result. 

( Bo and Yq are serially character by character fed into the multiplier, 
■A, and yV'o are parallel operands). 

The first phase result is Ro concatenated with the serial out put being right hand 
zeroes. 

The process of the second phase is the generic interaction of the operands; 
R®S-,®A-B®Yo'N 

( BandN are serially fed into the multiplier, A, and Yo are parallel operands) 

At the end of the second phase, the left hand slice of S-, preferably remains in the 
Sum buffer of the Accumulator - ready to be transferred - and the right hand slice has 
emanated from the device, typically into an S register. Note that multiplication in the 
prime number field has been performed in a conventional carr>' save summation method. 
Multiplication in the GF(2'*) reversed format mode has staned at the MS characters. The 
Y(, function has anticipated when a modulus value must be "added" into the 
accumulator. Except for the disabled carr>'' bits in the device, the process itself is 
identical for the two number systems. 

How the Y^^J zero-forcing vector may be derived in finite fields. 
Compute; Jo, = -.'Voo ' niod r. 

This single character of the function can be hardwire implemented with 
random logic, with simple circuitry, or with a simple look up table. Remember there are 
onlv 2^'' different values that must be derived in a look up table. In GF(p) prime 
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bers are odd, and the inverse of an odd number mod 2^ must always be odd. In ih( 
rse mode formal, the polynomial modulus must be right justified. Therefore, if 1 = 1 
/;, multiplier is implicitly 1. 
The result of a character output forced by the Yo function is always 0. 

0 = ( 5o © A^o'Boi © Voi^'oo ) mod r 

Solving the above equation for Yq, we see that Jq is the inverse of the right hand 

racter of /Vo- 

)'o, = (-AW'-(5o©^^.o*5oj))modr 

)■;, = (.^,•(5o©-^io*5oj))modr 
Fo^malizing the extended Super scalar multiplication method for both number 

fields: 

Set 5o = 0 

For / = 0 to w-1 (Interleave iteration) 

First phase: 
For,/ = 0 to k-\ 

Yo, = ( Jo,' ( So ® A-,o'Bo, ) ) mod r 
5. = (Si., ®A-Bo,®Yo,'iyo)/r 

Second phase: 
For / = k to n-\ 

S, = S,®ArB,®Yo'N, 
Implementation of the above algorithm with a character based serial-paralle 
nuliiplier is a simple extension of the above protocol: 

Set S, = 0 

For / = 0 to /77-1 (Interleaved loop) 

First phase: 
For / = 0 to 

)% = ( JooiS^o ® A-.o'Bo, ® S-Carryo © Sz/w, ®_Oitoiien({ %o © Si/mc r ))) mod r 
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For / = 0 to /:-] ( Whole loop with 1 clock pulse ) 

Sum, = ( Sum,,, © s carry, ©A^rBo^QVoj-No, ) mod r 
Carry, = ( Ouoiient{{Sum,., © Carry,® A,: Bo, © nj-A^oO. ^ ) 
(Output of multiplier in this stage is '0' s) 

Second phase: 
Main pari 

Carry^ = 0 
For / = kio n-\ 

For / = 0 to ( Whole loop with 1 clock pulse ) 

Sum, = {Sum,,, © S Carry ®_A„'B, ®_ Yo^-N,) mod r 
Carry = Ouo(iem{{Sum,,, © Carry,® A,,'B, @ JorA^j), r) 

S\y^ = (v^.j^.k © Sumo © S- Carry ^) mod r 
Ct^rrx. = Ouorient((S„.2^ © ^w/^Zo © C^rry,), r) 

Flushing of the multiplier 
For J = nio n^k-l 

For / = 0 to ( Whole loop with 1 clock pulse ) 
Sum, = (Sum,^, © Carry,) mod r 
Carrvt = Ouotient{{Sum,^, © Ci^rrj/,), r) 

^..j-k = (^j,,.,!. © Sumo © S' Carry ^) mod r 
Carn , = Quotient{(S„,2y © 5i/Wo © Carr^y^), r) 

For a formal explanation with examples of the panicular case where ; 
GF(p) field, see "PI". 

The above describes a microelectronic method and apparatus for p 

interleaved finite field ® modular multiplication of integers A and B op 

generate an output stream of times B modulus N having n characters in th^ 

operand register wherein n is larger than k, wherein the ® multiplication 

performed in iterations, wherein at each interleaved iteration with operands in 

® multiplying device, the operands consisting of A', the modulus, B, a mu 
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character string segment of .4, a 
4o string segment to the A^.i string 
ited into a next in turn S, temporary 

of iterative results are zeroes, the 

iracter registers feeding the multiplying 
preferably (A), configured to load the 

output operands, and may respectively be 
le and a modulus^ yV. 

:rabl>' operative to © summate into the 
I plurality of multiplicand values, during 
:ess. and in turn to receive as multipliers, 
"on the fly" anticipating value, and a Yq 
;e first emiuing right-hand zero output 
om the modulus, A^', register. 
; preferably operative to receive, in turn, 
:es, and in turn, also a multiplicand zero 

)erative to generate a binar\- string operative 
aeration and is operative to be a multiplicand 
.ication. 

hed into the © accumulation device for the 
St zero value, a second value. A,, which is a A' 
nd. A, and a third value A'o. being the first 
The yV'o value may rvpically be switched in at 
»urth preload buffer as in Fig. 6. Then when a k 
.e is serially summated with the No value and 

ingle k character modulus, then there is no need 
■ result value, S. If the operand is 2k slices or 
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longer, then ihe manipulations must be iterative, with progressing .-i, slices, 
operations slices of fl are typically snared from the B stream on the Ov : 
into the .Ai preload buffer. 

At the first iteration of a multiplication procedure, the temporar>- resu. 
Subsequent temporary- results from previous iterations, are ope: 
©summated with the value emanating from the © accumulation device: . 
partial result for the next iteration m turn. 

The multiplicand values to be input, in turn, into the © accumulatio 
the second phase are a first zero value, which is a pseudo register value, 
operand, remaining in place from the first phase, and a third Yo value 1 
anticipated in the first phase operative to continue multiplying the remainin 
of the ./V modulus. 

The multiplier values input into the multiplying device in the first pha^ 
emitting string. Bo. the first emining string segment of the B operand, con. 
multiplying with the second ® multiplier value consisting of the anticipate 
which is simultaneously loaded character by character as it is generated int, 
multiplicand buffer for the second phase. 

The two multiplier values input into the apparatus during the second 
preferably the left hand n-k character values from the B operand, designated 
left hand n-A' characters of the .'V modulus, designated N, respectively. 

The third phase is a flush out of the device operative to transfer the 
segment of a result value remaining in the © accumulation device. This may . 
smgle clock data dump, or a simple serial unload, driven by zero characters 
multiplier inputs. 

A particularly preferred embodiment of the present invention comprises 
mode multiplication in the GF(2='). Because of the lack of interaction betw< 
cells in this arithmetic, it is possible to perform multiplication and reductioi 
from the MS end of the product, thereby having a product that is the righ 
without a burdensome parasite, caused by disregarded zeroes which, are tanta 
performing a right shift in conventional Montgomer>' multiplication. 
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A further preferred embodiment that allows for automatic zero forcing is an 
extension of the >'„ function of patent application P 1 . PI describes a device wherein only 
one bit is anticipated at a time. There the Jqo bit, only, multiplies the single bit XORed 
values. Both the multiplicative inverse of an odd number and its negative value produce 
an odd number. This saves implementing a look up table or a random logic circuit to 
compute the../o value for 1 = 1. Note, Jo is a different quantity in non-alike number 
systems. We have shown in this extension how a Yo value can be derived, for both 
relevant number fields. 

The following describes the elements of the circuitry operative to anticipate the Yo 
value using first emitting values of the multiplicand, and present inputs of the B 
multiplier, carry out values from the e accumulation device, © summation values from 
the e accumulation device, the present values from the previously computed partial 
result, and carT>' out values from the © adder which © summates the result from the 
© accumulation device with the previous partial result. 

Stated differently, the six values that are operative to control the zero forcing 
function, are: 

i the 1 bit 5ou( bits of the result of the 1 bit by 1 bit mod 2' ® multiplication of the 
right-hand character of the Ai register times the character of the B Stream, 
.^o-^d mod 2'; 

ii the first emitting carry out character from the ©accumulation device, 

S(COo): 

iii the 1 bit Sout character from the second from the right-hand character emitting 
cell of the © accumulation device, SOi ; 

iv the next in turn character value from the S stream, Sd, 

■ V the 1 bit earn.' out character from the Z output full adder. S{COz); 
vi the I bit ./n value, which is the negative multiplicative inverse of the 
right-hand character in the No modulus multiplicand register: 

wherein values. A,rB, mod 2', S(COo), SOu S, are © added character to 
character together and "on the fly" ® multiplied by the Jo character to output a valid Yo 
zero-forcing anticipatory character to force an I bit egressing character string of zeroes. 
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Just as in PI . in order to determine if an output must be modular reduced, a sensor 
operative is to compare the output result to /V. the modulus, and the mechanism is itself 
operative to actuate a second subtracter on the output of the result register, thereby to 
output a modular reduced value which is limited congruent to the output result value 
precluding the necessity to allot a second memory storage for a smaller result. 

The single © accumulation device, configured to perform" rnulti'plication, and an 
anticipating zero forcing mechanism together are operative to perform a series of 
interleaved ® modular multiplications and squarings. The total device performs the 
equivalent of three integer multiplications, as in a conventional Montgomery method Jq 
\s a k character device multiplying the first k character summation of Bq-A\ and ^i, and 
fmally using the Vo to multiply M 

Whilst the SuperMAP computes the last iteration of a multiplication, the first slice 
of a next multiplication can be preloaded into a preload register buffer means on the fly. 
This value may be the result of a previous multiplication or a slice of a mulfiplicand 
residing in one of the register segments in the register bank of Fig. 1 or Fig. 5. 

The preloaded value, which is a ©summation of two multiplicands, is 
© summated into k slices of a register, only, for GFil"^) computations. In GEO) 
computations, provision must be made for an additional carry bit. 

Especially for very long moduli, buffers and registers adjacent to the SuperMAP 
typically have insufficient memor>' resources. Means for loading operands directly into 
preload buffers is provided, operative to store operands in the CPU's memory map. For 
reverse format multiplication, bit order of input words Irom the CPU are typically 
reversed in the Data In and Data Out devices.. 

BRIEF DESCRIPTION OF THE DR.A WINGS 
For a better understanding of the invention, and to show how the same may be 

carried into effect, reference will now be made, purely by way of example, to the 

accompanying drawings in which: 

Fig. I is a block diagram of the apparatus according to an embodiment of the 

invention where four main registers are depicted, and the serial data flow path to the 

operational unit and the input and output data path to the host CPU of Fig. 3 are shown; 
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Fig. 2 is a block diagram of an embodiment of an operational unit operative to 
manipulate data from Fig. 1: 

Fig. 3 is a simplified block diagram of a preferred embodiment of a complete 
single chip, monolithic cr\''ptocomputer. typically usable in sman cards; 

Fiu. 4 is a simplified block diagram of a preferred embodiment of a complete 
single chip monolithic cr>'ptocomputer wherein a data disable switch is operative to 
provide for accelerated unloading of data from the operational unit; 

Fig. 5 is a simplified block diagram of a data register bank, operanve to generate 

Fig. 6 is a simplified block diagram of an operational unit, wherein the YO sensor 
is a device operative to force a zero first phase output; 

Fig. 7A is a simplified block diagram of the main computational part of Fig. 6; 

Fig. 7B is an event timing pointer diagram showing progressively the process 
leading to and including the first iteranon of a squaring operation; 

Fig. 7C is a detailed event sequence to eliminate the "Next Montgomery Squaring" 
delays in the first iteration of a squaring sequence. 

Fi2. 7D illustrates the timing of the computational output of a first iteration of a 
multiplication sequence, relafing to Fig. 7A, 7B, and Fig. 7C; and 

Fig. 8 is a simplified schematic diagram showing how to generation the Yq 
function for 1 = 4, 

DESCRIPTION OF PREFERRED EMBODIMENTS 
Figs. 1 - 2, taken together, form a simplified block diagram of a serial-parallel 
arithmetic logic unit (ALU) constructed and operative in accordance with a preferred 
embodiment of the present invention. An apparatus according to Figs. 1 - 2, preferably 
includes the following components: 

Single Multiplexers - Controlled Switching Elements Ml to Ml 3 select one signal 
or character stream from a multiplicity of inputs of signals and direct the selected signal 
to given outputs. The multiplexers are marked, and are intrinsic pans of larger elements. 

M_K Multiplexer, 390, is an array of k+1 multiplexers, and chooses which of four 
k or k^l character inputs are to be added into a CSA. 410. 
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In addition there are provided four registers B (1000), Sa (130). Sb (180), and N 
(1005). which together are the four main serial main registers in a preferred 
embodiment. The Sa register is conceptually and practically redundant, but can 
considerably accelerate very long number computations, and save volatile memory 
resources, especially in a case where the length of the modulus is 2-k-m characters long. 

Serial- Adders and-Serial Subtractors are logic elements that have two serial 
character inputs and one serial character output, and summate or perform subtraction on 
two long strings of characters. Components 90 and 480 are subtractors, 330, and 460 are 
serial adders. The propagation time from input to output is very small. Serial subtractor 
90 reduces B* to B if B* is larger than or equal to N. serial subtractor 480, is used as 
part of a comparator component to detect if B* will be larger than or equal to N. A full 
adder 330 adds two character streams and feeds the result into a load buffer 340. The 
two character streams are the same values as those stored in load buffers 290 and 320 
and thus the result stored in load buffer 340 is equal to the sum of the values in load 
buffers 290 and 320. 

The Load Buffers Rl. 290; R2, 320; and R3, 340, as referred to above, are 
serial-in parallel-out shift registers adapted to receive the three possible more than zero 
multiplicand combinations. 

Fast loaders and unloaders, 10 and 20, and 30 and 40, respectively, are devices to 
accelerate the data flow from the CPU controller. Typically these devices are DMA 
controlled. Devices 20 and 40 are for reversing the data word as necessary for reverse 
format GF(2'^) multiplications. 

Data In device, 50, is a parallel in serial out device, as the present ALU device is a 
serial fed systolic processor, and data is fed in, in parallel, and processed in serial. 

Data Out device 60 is a serial in parallel out device, for outputting results from the 
coprocessor. A quotient generator 120 (Fig. 1) which generates a quotient character at 
each iteration of the dividing mechanism. 

Flush Signal generators 240, 250 and 260 provide Hush signals for the on Bd, S*d, 
and Nd registers respectively. The generators 240, 250 and 260 each preferably 
comprise a Hush control sianal Bd flush. Sd flush and Nd flush respectively anded with 
a data out signal. The flush signal generators are made to ensure that the last k+1 
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characters can flush out the CSA, as the alternative would be a complicated 
character parallel output element to retrieve the MS k+1 characters of the accumulator. 

Latches LI. 360: L2. 370: and L3. 380; are made to receive the outputs from the 
load buffers 290, 320 and 340. thereby allowing the load buffers the temporal 
enablement to process the next phase of data before this data can be latched into L2, L2, 
and L3. 

There is also provided a YO Sensor, 430, which is a logic device that determines 
the number of times the modulus is accumulated, in order that a k character string of LS 
zeros will exit at Z in ® multiplications. 

One character delay devices 100, 220 and 230 are inserted in the respective data 
streams to accommodate for synchronization problems between the data preparation 
devices in Fig. 1. and the data processing devices in Fig. 1. 

The k character delay, shift register, 470, preferably receives the Nd result after 
disregarding zero output strings of the synchronized N for the larger than N comparison. 

The Carry Save Accumulator 410 is almost identical to a serial/parallel multiplier, 
except that three different larger than zero values can be summated, instead of the single 
value usually latched onto the input of the s/p multiplier. When used in polynomial 
based computations "all carry dependent" functions are preferably disabled. 

A D-type flip-flop Insert Last Carry, 440, is used to insert an mk+l'th character of 
the S stream as the S register is only mk characters long. 

A borrow/overflow detector, 490, is connected to the output of the accumulator 
410 and to the output of adder 460. It can thus either detect if the accumulator result is 
larger than or equal to the modulus (from N), or if the m-k Tth bit is a one. In poly based 
arithmetic it would detect the first significant result bit. 

The control mechanism is not depicted, but is preferably a set of cascaded 
counting devices, with switches set for systolic data flow. 

For modular multiplication in the prime and composite prime field of 
numbers, we define A and B to be the multiplicand and the multiplier, and N to be the 
modulus which is usually larger than .A or B. N also denotes the register where the value 
of the modulus is stored. N, may. in some instances, be smaller than A. We define A, B, 
and N as nvk = n character long operands. Each k character group will be called a 
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seuniein. the size of the group defined by the size of the multiplying device. Then A, B, 
and N are each m characters long. For ease in following the step by step procedural 
explanations, assume that A. B. and N are 512 bits long, (n = 512): assume that k is 64 
characters lone because of the present cost effective length of such a multiplier and data 
manipulation speeds of simple CPUs. In addition, m = 8 is the number of segments in 
an operand and also the number of iterations in a squaring or multiplying loop with a 
5 12 bit operand. All operands are positive integers. More generally, A, B, N, n, k and m 
may assume any suitable values. 

In non-modular functions, the N and S registers can be used for temporar>^ 
storage of other arithmetic operands. 

As discussed above, we use the symbol, to denote congruence of modular, 
numbers, for example 16^2 mod 7, and we say 16 is congruent to 2 modulo 7 as 2 is 
the remainder when 16 is divided by 7. When we write Y mod N = X mod N; both Y 
and X may be larger than N: however, for positive X and Y, the remainders will be 
identical. Note also that the congruence of a negative integer Y, is Y u N, where N is 
the modulus, and if the congruence of Y is to be less than N, u will be the smallest 
integer which will give a positive result. 

We use the symbol, ¥, to denote congruence in a more limited sense. During 
the processes described herein, a value is often either the desired value, or equal to the 
desired value plus the modulus. For example X¥ 2 mod 7. X can be equal to 2 or 9. 
We say X has limited congruence to 2 mod 7. 

When we write X = A mod N, we define X as the remainder of 
A divided by N; e.g., 3 = 45 mod 7. 

In number theory the modular multiplicative inverse is a basic concept. For 
exan-iple. the modular multiplicative inverse of X is wrinen as X'^, which is defined by 
XX-lmodN=l. If X = 3. and N-13. then X-l=9, i.e.. the remainder of 
3-9 divided by 13 is 1. 

The acronyms MS and LS are used to signity^ most significant and least 
significant when referencing bits, characters, segments, and full operand values, as is 
conventional in digital nomenclature. 
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As mentioned above, throughout this specification N designates both the 
value N. and the name of the shift register which contains N. An asterisk superscript on 
a value, denotes that the value, as stands, is potentially incomplete or subject to change. 
A is the value of the number which is to be exponentiated, and n is the character length 
of the N operand. After initialization when A is ''Montgomer>' normalized" to A* 
(.A='=2"A - .to be explained later) .A* and N are constant values throughout the 
intermediate step in the exponentiation. During the first iteration, after initialization of 
an exponentiation, B is equal to A*. B is also the name of the register wherein the 
accumulated value which finally equals the desired result of exponentiation resides. 
S* designates a temporary value, and S, Sa and Sb designate, also, the register or 
registers in which all but the single MS bit of S is stored. (S* concatenated with this MS 
bit is identical to S.) S(i-l) denotes the value of S at the outset of the i'th iteration; 
Sq denotes the LS segment of an S(i) 'th value. 

We refer to the process, (defined later) /tA-B)N as muUiplicanon in the P 
field, or sometimes, simply, a multiplication operation. 

As we have used the standard structure of a serial/parallel multiplier as the 
basis for constructing a double acting serial parallel multiplier, we differentiate between 
the summating part of the multiplier, which is based on carry save accumulation, (as 
opposed to a carry look ahead adder, or a ripple adder, the first of which is considerably 
more complicated and the second very slow), and call it a carry save adder or 
accumulator, and deal separately with the preloading mechanism and the multiplexer 
and latches, which allow us to simultaneously multiply A fimes B and C times D, 
summate both results, e.g., A B-^C-D, converting this accumulator into a very powerfial 
engine. Additional logic is added to this multiplier in order to provide for an anticipated 
sense operanon necessary for modular reducuon and serial summation necessary to 
provide powerful modular arithmetic and ordinary integer arithmetic on very large 
numbers. 

Montgomery Modular Multiplication 
In a classic approach for computing a modular multiplication, A-B mod N, 
the remainder of the product A-B is computed by a division process. Implementing a 
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conventional division of large operands is more difficult to perform than serial/parallel 
multiplications. 

Using Montgomer>''s modular reduction method, division is essentially 
replaced by multiplications using two precomputed constants. In the procedure 
demonstrated herein, there is only one precomputed constant, which is a function of the 
modulus.. This constant is, or can .be,_ computed using an ALU device according to the 
present invention. 

A simplified presentation of the Montgomery process, as is used in a device 
according to a preferred embodiment of the present invention, is now provided, 
followed by a complete preferred description. 

If we have an odd number (an LS bit one), e.g., 1010001 (=81io) we can 
always transform this odd number to an even number (a single LS bit of zero) by adding 
to it another fixing, compensating odd number, e.g., 1 1 1 1 (=l5io); as 1 11 1 lOlOOOl = 
i 1 00000 (96|o). In this particular case, we have found a number that produced five LS 
zeros, because we knew in advance the whole string, 81, and could easily determine a 
binar\' number which we could add to 81, and would produce a new binary number that 
would have as many LS zeros as we might need. This fixing number is be odd, else it 
has no effect on the progressive LS characters of a result. 

If our process is a clocked serial/parallel carry save process, where it is 
desired to have a continuous number of LS zeros, and wherein at each clock cycle we 
only have to fix the next bit, at each clock it is sufficient to add the fix, if the next bit 
would potentially be a one or not to add the fix if the potential bit were to be a zero. 
However, in order not to cause interbit overflows (double carries), this fix is preferably 
summated previously with the multiplicand, to be added into the accumulator when the 
relevant multiplier bit is one, and the Y Sense also detects a one. 

Now. as in modular arithmetic, we only are interested in the remainder of a 
value divided by the modulus, we know that we can add the modulus any number of 
times to a value, and still have a value that would have the same remainder. This means 
that we can add Y1M=Iy,r' N to any integer, and still have the same remainder; Y being 
the number of times we add in the modulus. N. to produce the required LS zeros. As 
described, the modulus that we add can only be odd. Methods exist wherein even 
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moduli are defined as r' times the odd number that results when i is the number of LS 
zeros in the even number. 

The problem solved by the Montgomery interleaved variations, is aimed at 
reducing the limited storage place we have for numbers, and the cost effective size of 
the multipliers. This is especially useful when performing public key cryptographic 
functions where we are constantly multiplying one large integer, e.g., n=1024 bit, by 
another large integer; a process that would ordinarily produce a double length 2048 bit 
integer. 

We can add in Ns (the modulus) enough times to A-B=X or A-B^S=X 
during the process of multiplications (or squaring) so that we will have a number, Z, that 
has n LS zeros, and, at most, n+1 MS characters. 

We can continue using such numbers, disregarding the LS n characters, if 
we remember that by disregarding these zeros, we have divided the desired result by r". 

When the LS n characters are disregarded, and we only use the most 
significant n (or n-^1) characters, then we have effectively multiplied the result by r'", the 
modular inverse of r". If we would subsequently re-multiply this result by r" mod N (or 
r") we would obtain a value congruent to the desired result (having the same remainder) 
as A-B-S mod N. As is seen, using MM, the result is preferably multiplied by r"" to 
overcome the r'" parasitic factor reintroduced by the MM. 
E.xample: 

A-B+S modN = (12-ll + lO) mod 13 = (1 100-101 1-MOIO). mod lOlh. 
/ = 1. r = 2 

We will add in 2' N whenever a fix is necessary on one of the n LS bits. 

B 1011 
X A 1 100 
add S 1010 
add A(0) B 0000 

-— sum of LS bit = 0 not add N 
add 2** (N O) 0000 
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sum 0101 ->0 LS bit leaves carry save adder 

add Ad) B 0000 

sum of LS bit = 0 - add N 
add 2'(N-1) 1 101 

sum 1001 ^0 LS bit leaves CS adder 

addAf-2.)-B 1011 . 

— sum LS bit = 0 don't add N 
add 2-(N-0) 0000 

sum 1010 ->0 LS bit leaves CS adder 

add A(3) -B 1011 

— sum LS bit = 1 add N 
add2^(N-l) 1 101 

sum 1 000 1 ^0 LS bit leaves CS adder 

And the result is 10001 OOOO.mod 13=17-2'^ mod 13. 

As 1 7 is larger than 13 we subtract 13, and the result is: 
17 - 2' ^ 4-2^ mod 13. 

formally 2""(AB-^S)mod N = 9 (12-1 KIO) mod 13=4 

In Montgomery arithmetic we utilize only the MS non-zero result (4) and 
effectively remember that the real result has been divided by 2"; n zeros having been 
forced onto the MM result. 

We have added in (8-^2) -13= 10- 13 which effectively multiplied the result by 
2"* mod 13 = 3. In effect, had we used the superfluous zeros, we can say that we have 
performed, A-B-^Y-N-^-S - (12-1 l-MO- 13-^10) in one process, which will be described 
below in respect of one preferred embodiment. 

Check- (1211^10) mod 13 = 12; 4 • 3 = 12. 

In summary, the result of a Montgomery Multiplication is the desired result 
multiplied by 2'". 
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To retrieve the previous result back into a desired result using the same 
multiplication method, we would have to Montgomery Multiply the previous result by 
2"'\ which we will call H. as each MM leaves us with a parasitic factor of 2"", 

The Montgomer>'' Multiply function /^A B)N performs a multiplication 
modulo N of the A-B product into the P field. (In the above example, where we derived. 
4). The retrieval fromjhe P field back into the normal modular field is performed by 
enacting P on the result of /^A-B)N using the precomputed constant H. 
Now. if P = /^A-B)N. it follows that /tP'H)N = A-B mod N; thereby performing a 
normal modular multiplication in two P field multiplications. 

Montgomery modular reduction averts a series of multiplication and 
division operations on operands that are n and 2n characters long, by performing a series 
of multiplications, additions, and subtractions on operands that are n or n+1 characters 
long. The entire process yields a result which is smaller than or equal to N, For given A, 
B and odd N there is always a Q, such that A-B -r Q-N will result in a number whose n 
LS characters are zero, or: 

p.2n = A B Q N 

This means that we have an expression that is 2n characters long, whose 
n LS characters are zero. 

Now, let l-v^=l modN (I exists for all odd N). Multiplying both sides of 
the previous equation by I yields the following congruences: 
from the left side of the equation: 

P l r^ =P mod N; (Remember that I-r^ = 1 mod N) 
and from the right side: 

A-B I ^ Q N I = AB-I mod N ; (Remember that Q-N'-I = 0 mod N) 
therefore: 

P = A B-1 mod N . 

This also means that a parasitic factor I=r*" mod N is introduced each time a 
P field multiplication is performed. 

We define the P operator such that: 

P ^ A B I mod N ^ ftA B)N. 
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and we call this "multiplication of A times B in the P Field", or Montgomery 
iVIultiplication. 

The retrieval from the P field can be computed by operating ? on P H, making: 

/^p.H)N = A-B mod N : 
We can derive the value of H by substituting P in the previous congruence. 

We hnd: - - - - - 

/tP-H)N^(A-B-I)(H)(I)modN; 

(Note that A-B-I ^ P; H<-H; and any multiplication 

operation introduces a parasitic I) 

If H is congruent to the multiple inverse of I- then the congruence is 
valid, therefore: 

H = I"- mod N = r^n mod N 

(H is a function of N and we call it the H parameter) 
In conventional Montgomery methods, to enact the P operator on A-B, 
the following process may be employed, using the precomputed constant J: 

1) X = A-B 

2) Y = (X-J) mod r^^ (only the n LS characters are necessary) 

3) Z = X-Y-N 

4) S = Z / r" (The requirement on J is that it forces Z to be divisible by r^^) 

5) P ¥ S mod N (N is to be subtracted from S, if S > N) 
Finally, at step 5) : 

P ¥ /tA-B)N. 
[After the subtraction of N, if necessary: 

P = /t:A-B)N.] 
Following the above: 

Y = A-B-J mod r" (using only the n LS characters) ; 

and: 

Z = A B ^ (A'B-J mod 

In order that Z be divisible by r^ (the n LS characters of Z are preferably 
zero) the following congruence preferably exists:' 

[A-B -i- ( A B J mod rn) N] mod r^ = 0 

-1 15 05 00 
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In order that this congruence will exist, N J mod r*^ are congruent to -1 

or: 

J = -N-l mod r^, 
and we have found the constant J. 

J. therefore, is a precomputed constant which is a function of N only. 
However, in- a machine, that outputs a MM result, character by character provision 
should be made to add in Ns at each instance where the output character in the LS strmg 
would otherwise have been a zero, thereby obviating the necessity of precomputing J 
and subsequently computing Y = A-B-J mod r^^, as Y can be detected character by 
character using hardwired logic. As discussed in detail above, this method can only 
work for odd Ns. 

Therefore, as is apparent, the process described employs three 
multiplications, one summation, and a maximum of one subtraction, for the given A, B, 
N, and a precomputed constant to obtain i^A-B)N. Using this result, die same process 
and a precomputed constant, H, (a function of the modulo N), we are able to find 
A B mod N. As A can also be equal to B, this basic operator can be used as a device to 
square or multiply in the modular arithmetic. 

Interleaved Montgomery Modular Multiplication 
The previous section describes a method for modular multiplication which 
involved multiplications of operands which were all n characters long, and results which 
required 2n + 1 characters of storage space. 

Using Montgomery's interleaved reduction (as described in the 
aforementioned paper by Dusse), it is possible to perform the mukiplication operations 
with shorter operands, registers, and hardware multipliers; enabling the implementation 
of an electronic device with relauvely few logic gates. 

First we will describe how the device can work, if at each iteration of the 
interleave, we compute the number of times that N is added, using the Jo constant. Later, 
we describe how to interleave, using a hardwire derivation of Yo. vvhich will eliminate 
the Jo- phase of each multiplication {2) in the following example}, and enable us to 
integrate the functions of two separate serial/multipliers into the new single generic 
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multiplier which can perform A B-^C Nh-S at better than double speed using similar 
silicon resources. 

Using a k character multiplier, it is convenient to define segments of k 
character length: there are m segments in n characters: i.e., m-k = n. 
j(} will be the LS segment of J, 
Therefore: ' " ' 

Jo = -Nq'^ mod r*^ (Jo exists as N is odd). 

Note, the .1 and Jo constants are compensating numbers that when enacted 
on the potential output, tell us how many times to add the modulus, in order to have a 
predefined number of least significant zeros. We will later describe an additional 
advantage to the present serial device; since, as the next serial bit of output can be easily 
determined, we can always add the modulus (always odd) to the next intermediate 
result. This is the case if, without this addition, the output character, the LS serial bit 
exiting the CSA, would have been a 'T': thereby adding in the modulus to the previous 
even intermediate result, and thereby promising another LS zero in the output string. 
Remember, congruency is maintained, as no matter how many times the modulus is 
added to the result, the remainder is constant. 

In the conventional use of Montgomery's interleaved reduction, /^A-B)N is 
enacted in m iterations as described in the following steps (1) to (5): 

Initially S(0) = 0 (the ¥ value of S at the outset of the first iteration). 
For i = 1 , 2....m : 

1) X = S(i-l) Ai.pB (Ai.i is the i-1 th character of A ; S(i-l) is the value 
of S at the outset of the i'th iteration.) 

2) Yo = Xo- Jo mod r^^ (The LS k characters of the product of Xq- Jo) 

(The process uses and computes the k LS characters only, e.g., the 
least significant 64 characters) 

In the preferred implementation, this step is obviated, because in a serial machme 
Yo can be anticipated character by character. 

3) Z = X-Yo-N 

4) S(i) = Z/rl< (The k LS characters of Z are always 0, therefore Z is always 
divisible by r^^. This division is tantamount to a k character right shift as the LS k 
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characters oFZ are all zeros; or as will be seen in the circuit, the LS k characters of Z are 
simply disregarded. 

(5) s(i) = S(i) mod N (N is to be subtracted from those S(i)'s which are 
larger than N ). 

Finally, at the last iteration (after the subtraction of N, when necessary), 

C = S(m) = i^AB)N- _ 

To derive F = A-B mod N. the P field computation, HC'H)N, is performed 

[t is desired to know, in a preferred embodiment, that for all S(iys, S(i) is smaller 

than 2N. This also means, that the last result (S(m)) can always be reduced to a quantity 

less than N with, at most, one subtraction of N. 

We observe that for operands which are used in the process: 

S(i- 1 ) < r""' (the temporary register can be one bit longer than the B or N register), 
B<N<r" andAi.i < rK 
By definition: 

S(i) = Z/rk ■ (The value of S at the end of the process, before a possible 
subtraction ) 

For all Z, Z(i) <r"^^"'- 

X... = S...-A.-B<r"^'-l-(r^-l)(r"-l) 

Qmax=YoN<(r'-l)(r"-l) 

therefore: 

Zmax<r - r -rl < r -1. 
and as Zmax is divided by r^: 
S(m)<r""' -r'. 

Because Nmin> r"-r, S(m)n,a.x is always less than 2-Nmm, one subtraction is all that is 
necessary on a final result. 

S(m),.. - N.in = (r"^' - r' - I) - (r" - 1) - r" - 4 < N... 

Example of a Montgomery interleaved modular muUiplication: 
The following computations in the hexadecimal format clarity the meaning 
of the interleaved method: 
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N = a59. (the modulo). A = 99b, (the multipher), B = 5c3 (the multiplicand), 
n = 12. r = 2. (the character length of N), k = 4, (the size in characters of the multiplier 
and also the size of a segment), and m = 3, as n = km. 

.1,1 = 7 as 7-9 s - 1 mod 1 6 and H s 22- 1 2 mod a59 s 44b. 

The expected result is ¥ = A B mod N = 99b-5c3 mod a59 3758 1 1 
mod n59 = 220] 5. 

initially: S(0) = 0 

Siep I X = S(0)-Ao-B = 0^boc3 = 3f6l 

Yo = Xq- Jo mod 7 (Yo - hardwire anticipated in new MAP) 
Z = X Yo N = 3f61 + 7-a59 = 87dO 
S(l) = Z/rk = 87d 

Step 2 X = S(l) + Ai-B = 87d + 9oc3 = 3c58 

Yo = Xo- Jo mod t^ = S-7 mod 2^ = 8 (Hardwire anticipated) 
Z = X + Yo-N = 3c58 + 52c8 = 8f20 
S(2) = Z / rk = 8fZ 

Siep 3 X = S(2) + A2 B = Sf2 + 9-5c3 = 3ccd 

Yo = d-7 mod 2^ = b (Hardwire anticipated) 
Z = X + Yo-N = 3ccd -H b a59 = aeaO 
S(3) = Z/rk = aea, 

as S(3)>N , 

S(m)=S(3) - N = aea - a59 = 91 
Therefore C = /tA B)N = 9 1.15. 

Retrieval from the P tleld is performed by computing i^C'H)N: 
Again initially: S(0) = 0 

Sicp I X = S(0) ^ C() H = 0-1 •44b = 44b 



Yo = d (Hardwire anticipated in new MAP) 
Z = X Yo-N = 44b ^ 8685 = 8ad0 
S(l) = Z/rk = 8ad 

S/cp 2 X = S( 1 ) ^ C 1 -H = Sad ^ 9-44b = 2f50 

Yo = 0 ( Hardwire anticipated in new MAP) 
Z = X Yo-N = 2f50 0 = 2f50 
S(2)-Z/rl< = 2f5 



Sfcp 3 X = S(2) ^ C2-H = 2f5 ^ 0-44b = 2f5 

Yo = 3 (Hardwire anticipated in new MAP) 
Z = X ^ Yo-N = 2f5 3-a59 = 2200 
S(3)-Z/rl^ = 220i6 
which is the expected value of 99b 5c3 mod a59. 

If at each step we disregard k LS zeros, we are in essence multiplying the n MS 
characters by r'^. Likewise, at each step, the i'th segment of the multiplier is also a 
number multiplied by r'^, giving it the same rank as S(i). 

It may also be noted that in another preferred embodiment, it is of potential value 
to know the Jo constant. 



Exponentiation 

The following derivation of a sequence [D. Knuth, The an of computer 
programming, vol. 2: Seminumerical algorithms, Addison-Wesley, Reading Mass., 
1981] hereinafter referred to as "Knuth", explains a sequence of squares and multiplies, 
which implements a modular exponentiation. 

After precomputing the Montgomery constant, H= 2"", as this device can 
both square and multiply in the P field, we compute: 

C = aE mod N. 

Let E(i) denote the j bit in the binar>' representation of the exponent E, staning 
with the MS bit whose index is 1 and concluding with the LS bit whose index is q, we 
can exponentiate as follows for odd exponents: 
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A* ¥/|:A H)N 
B = A* 

FORj =2 TO q-1 



A* is now equal to A-2". 



B ¥ /^B B)N 

IF E(j) = 1 THEN 

B¥ /^B A*)N 

ENDFOR 

B ¥ /^B A)N E(0)=1 ; B is the last desired temporary result multiplied by 

2". 

A is the original A. 

C = B 

C= C-N if C>N. 

After the last iteration, the value B is ¥ to A^ mod N. and C is the final value. 

To claritA'. we shall use the following example: 

E = 1 0 1 1 > E( 1 ) = 1 ; E(2) = 0; E(3) = 1 : E(4) = 1 ; 

To find A^Oll modN;q = 4 

A* = /tA H)N = AI-2 I=AI-1 mod N 
B = A* 

FORj = 2 to q 

B = /^B B)N which produces: a2(I-1)2.I = A- l'^ 

E(2) = 0: B = A-I-1 

i = 3 B = ftBB)N = a2(I-1)2.I = A^l-' 

If) 15 05 00 
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Ef3) = 1 B = /tB A*)N= (A^4-l) (AI'l)*! = A^-I"! 
j = 4 B - /^B-B)N = a10 i-2 i = aIOj-I 
As E(4) was odd, the last multiplication will be by A, to remove the parasitic I'l. 
B = /^BA)N = aIO-I-I-AI = A^ 1 
C = B 

A method for computing the H parameter by a reciprocal process is described in 
l.'S Patent 5.513.133. 

Reference is now made to Fig. 3 which is a simplified block diagram showing 
how the present invention may be implemented in a smart card. An internal bus 500 
links components including a CPU 502, a R.AM 504, a ROM 506, a controlled access 
ROM 508. and modular arithmetic co-processor 510. As shown herein, the 
co-processor 510 is connected via data 512 and control 514 registers to the internal bus 
500. The controlled access ROM 508 is connected via address and data latch means 516 
and a control and test register 518. Various other devices may be attached to the bus 
such as a physical sequence random generator 520, security logic 522 and interfacing 
and resetting circuitry 524 and 526 respectively. 

When a cryptographic program such as verifying an RSA signature is run, it may 
require modular arithmetic functions such as modular exponentiation. The 
cryptographic program that calls the cryptographic function is preferably run on the 
CPU 502. However the modular arithmetic function is carried out on the co-processor 
510, the CPU 502 serving only to assist with storage and like menial tasks. 

Reference is now made to Fig. 4 which is another simplified block diagram of an 
implementation of the present invention for use in a sman card. Parts that are the same 
as those shown in Fig. 3 are given the same reference numerals and are not described 
again, except as necessary for an understanding of the present embodiment. In Fig. 4 
the CPU 502 is shown with an external accumulator 530. Detaching the CPU 
Accumulator from the Data Bus 500, while unloading data from the arithmetic 
co-processor enables direct transfer of data from the SMAP to memory. 

Fig. 5 is simplified block diagram of a preferred embodiment of a data register 
bank within a co-processor such as co-processor 510. with a Jo generator The 
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co-processor 510 is coanected to a data bus with a CPU as in previous figures. Parts 
that are the same as those shovvn in previous figures are given the same reference 
numerals and are not described again, except as necessary for an understanding of the 
present embodiment. A register bank 540 comprises a B register 542, an A register 544, 
an S register 546. and an N register 548. The outputs of each of the registers are 
connected to a serial data switch and serial process_conditioner 550 which in turn is 
connected to an operational unit 552 which carries out the modular arithmetic 
operations. Connected between the N register 548 and the operational unit 552 is a Jo 
generator 552. 

In the embodiment the Jo generator compiles a I bit primary zero forcing function 
for use in the modular arithmetic functions described above. 

Fig. 6 is a simplified internal block diagram of the operational unit of Fig. 5. The 
unit preferably supports accelerated squaring operations, in that the additional YOBO • 
serial buffer accepts a YO value in the first multiplication phase, and in the second 
multiplication phase, a modular reduced Bo is used for 

Reference is now made to Fig. 7a which is a block diagram of the main 
computational part of the operational unit of Fig. 6. Numbers appearing in circles relate 
to the sequence diagrams of Figs. 7B to 7D. 

Reference is now made to Fig. 7b which is an event timer pointer diagram 
showing progressively the process leading to and including the first iteration of a 
squaring operation. 

Reference is now made to Fig. 7c which is a generalized event sequence showing 
a method of eliminating the Ne.xt Montgomery Squaring delays in a first iteration of a 
squaring sequence. Circled numbers refer to Figs. 7a, b and d. 

Reference is now made to Fig. 7d which is a generalized event timer pointer 
diagram illustrating the timing of the computational output of the first iteration of a 
squaring operation. 

Reference is now made to Fig. 8 which is a schematic diagram for designing a 4 
bit Y„ zero forcing function. The variable inputs into the force t\inction f, are the No 
bits, the four So bits, and the four multiplier and multiplicand bits. A.o, Boi, and the carry 
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switch. S. fi is random logic, the A and B bits are input into a ® multiplier and ©added 
to the Sn-partial result. When S = 0, all carries are disconnected. 

It is appreciated that various features of the invention which are, for clarity, described 
in the contexts of separate embodiments may also be provided in combination in a 
single embodiment. Conversely, various features of the invention which are, for brevity, 
described injhe context of a single embodiment may also be provided separately or m 
any suitable subcombination. 

It will be appreciated by persons skilled in the art that the present invention is not 
limited to what has been panicularly shown and described hereinabove. Rather, the 
scope of the present invention includes both combinations and subcombinations of the 
various features described hereinabove as well as variations and modifications thereot 
which would occur to persons skilled in the art upon reading the foregoing description 
and which are not in the prior art. 

In the following claims, symbols such as ® and ® have the meanings given in the 
preceding description. 
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CLAIMS 



1. Microelectronic apparatus for performing ® multiplication and squaring in 
both polynomial based GF(2^) and GF(p) field arithmetic, squaring and reduction using 
a serial fed radix 2' multiplier, with k character multiplicand segments, A,, and a k 
character © accumulator wherein reduction to a limited congruence is performable "on 
the rly". in a systolic manner, with a multiplicand, times 5, a multiplier, over a 
modulus. .V. and a result being at most 2k ^ 1 characters long, including k first emitting 
disregarded zero characters, which are not saved, where k characters have no less bits 
than the modulus, wherein said operations are carried out in two phases, the apparatus 
comprising: 

a first (B), and second (N) main memory register, each register operative to hold at 
least n bit long operands, respectively operative to store a multiplier value designated B, and 
a modulus, denoted N, wherein the modulus is smaller than 2"; 

a digital logic sensing detector, Yq, operative to anticipate "on the fly" when a modulus 
value is to be © added to die value in die ® adder accumulator device such that all first k 
characters emitted from the device are forced to zero; 

a modular multiplying device for at least k character input multiplicands, with only 
one. at least k characters long © adder, a © summation device operative to accept k 
character muldplicands, the ® multiplication device operative to switch into the 
©accumulator device multiplicand values in turn, and in turn to receive multiplier 
values from a B register, and an "on the fiy" simultaneously generated anticipated value 
as a multiplier which is operative to force k first emitting zero output characters in the 
first phase, wherein at each effective machine cycle at least one designated multiplicand 
is © added into the ® accumulation device; 

the multiplicand values to be switched in turn into the ©accumulation device 
consisting of one or two of the following three multiplicands, the first muluplicand 
being an all-zero string value, a second value, being the multiplicand A\, and a third 
value, the iV,) segment of the modulus; 

an anticipator to anticipate 1 bit k character serial input Y'o multiplier values; 
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the apparatus being operable to input in turn multiplier values into the multiplying 
device in the first phase, said values being first the B operand, and concurrently, the 
second multiplier value consisting of the To, "on the fly" anticipated k character string, 
to force first emitted zeroes in the output; 

the apparatus further comprising an © accumulation device, operative to output 
values simultaneously as multiplicands are © added into the © accumulation device; 
and 

an output transfer mechanism, operative in the second phase to output a final 
modular® multiplication result from the ©accumulation device. 

2. Apparatus as in claim 1 wherein © summations into the © accumulation 
device are activated by each one of a series of successively newly serially loaded higher 
order multiplier character values. 

3. Apparatus as in claim 1 or claim 2, wherein the multiplier characters are 
operative to cause no ©summation into the © accumulation device if both the input B 
character and the corresponding input Yq character are zeroes; 

are operative to ©add in only the A\ multiplicand if the input B character is a one 
and the corresponding Vo character is a zero; 

are operative to © add in only the /V, modulus, if the B character is a zero, and the 
corresponding Yo character is a one; and 

are operative to ©add in the ©summation of the modulus, with the 
multiplicand A, if both the B input character and the corresponding Yq character are 
ones. 

4. Apparatus as in claim i, 2 or 3, operative to preload muUiplicand values A, 
and .V. into two designated preload buffers, and to © summate these values into a third 
multiplicand preload buffer, obviating the necessity of ©adding in each multiplicand 
value separately. 

5. .Apparatus according to any preceding claim, wherein the multiplier 
character values are arranged for input in serial single character torm, the 
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©accumulation device is arranged for output in serial single character form, and 
wherein the Ko detect device is operative to anticipate only one character in a clocked 
turn. 

6. Apparatus according to any preceding claim, wherein the © accumulation 
device is operable to perform modulo 2, XOR addition/subtraction, and wherein all 
carry bits in addition and subtraction components are disregardable, thus not needing 
provisions for overflow and further limiting modular reduction in computations. 

7. Apparatus according to any preceding claim wherein carry inputs are 
disabled to zero, and being operative to perform polynomial based multiplication. 

8. Apparatus according to any preceding claim, operative to provide non-carry 
arithmetic by setting 5 equal to zero acting on an element in a circuit equation 
computing in GF(2^), 

9. Apparatus according to any preceding claim operative to provide non-carry 
arithmetic by omining carry circuitry such that the S designates omitted circuitry and 
reducing adders and subtractors, designated © to XOR, modulo 2 addition/subtraction 
elements. 

10. Apparatus as in claim 1 adapted such that the first k character segments 
emitted from the operational units are zeroes, zero forcing being controlled by the 
following four quantities in anticipating the next in turn Ko character: 

i the / bit Som bits of the result of the / bit by / bit mod 2'_® multiplication of the 
right-hand character of the .-1, register times the 5d character of the B Stream, 
Aq'B^ mod 2^; 

ii the first emitting carry out character from the © accumulation device, S(COo); 

iii the / bit Sout character from the second from the right character emitting cell of 

the © accumulation device. SO\ : 
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iv the / bit Jo value, which is the negative multiplicative inverse of the right-hand 
character in the Nq modulus multiplicand register, 
wherein values, .-^o-^d mod 2\ S{COq), and SO\ are © added character to 
character together and "on the fly" multiplied by the Jq character to output a valid Vq 
zero-forcing anticipator}' character to force an / bit egressing string of zeroes. 

11. Apparatus according to any preceding claim, operable to perform ® 
multiplication on polynomial based operands in a reverse mode, said multiplication 
comprising multiplying from right hand MS characters to left hand LS characters, said 
apparatus further being operative to perform modular reduced ® multiplication without 
utilizing Montgomer\' type parasitic functions. 

12. Apparatus according to any preceding claim, further comprising preload 
buffers, which buffers are serially fed and which are connected such that multiplicand 
values are preloadable into the preload buffers on the fly from one or more memory 
devices. 

13. Apparatus according to any preceding claim, operable to © sum a previously 
emitted value from an additional n bit register, S, into the output value of the 
© accumulation device via a 1 bit © adder circuit such that first emitting output 
characters are zeroes. 

wherein the Yo detector is operative to detect any necessity of© adding moduli to 
the © summation in the © accumulation device, 

wherein the Ko detector is further operative to detect utilization of the next in turn 
©added characters Ao-B, mod 2^. S{COo). SO,. 5d and S(CO^). and the composite of 
© added characters to be finite field ® multiplied on the fiy by the / bit Jo value. 

14. Apparatus according to claim 10 or claim 13, wherein for / = I, Jq is 
implicitly 1 , and the Jo ® multiplication is carried out implicitly. 

15. Apparatus according to any preceding claim, wherein a comparator is 
operative to sense a finite field output from the ® modular multiplication device whilst 
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operating in GF(p), wherein the first right hand emiaing k zero characters are 
disregarded, wherein the output is larger than the modulus, N, the apparatus thereby 
being operative to control a nnodular reduction whence said value is output from a 
menioi-\'- register to which an output stream from the multiplier device is destined, and 
thereby not requiring a second memory storage device for smaller ones of resulting 
product values, 

16. Apparatus according to any preceding claim, which, for ® modular 
multiplication in the GF(2'^), is operative to carry out multiplication without an 
externally precomputed more than 1 bit zero-forcing factor. 

17. Apparatus according to any preceding claim, operative to compute a Jq 
constant by resetting either the A operand value or the B operand value to zero and 
setting a panial result value, 5o, to 1. 

IS. Microelectronic apparatus for performing interleaved finite field® modular 
multiplication of two integers A and B, so as to generate an output stream of ^ times B 
modulus /V. wherein a number of characters in a modulus operand register, n, is larger 
than a segment length of k characters wherein the ® modular multiplication is 
performed in a plurality of interleaved iterations, wherein at each interleaved iteration, 
operands are input into a ® multiplying device, said operands comprising any one of N, 
5. a previously computed partial result, S, and a k character string segment of A, the 
segments progressing from a first string segment v4o to a higher string segment .^m-i, 
wherein each iterative result is © summated into a next temporary result, wherem at 
least first emitted characters of said iterations are zeroes, the apparatus comprismg: 

first (B). second (5) and third (AO main memory registers, each register respectively 
operative to store a multiplier value, a panial result value and a modulus; 

a modular multiplying device operative to © summate into an © accumulation 
device, in turn one or two of a plurality of multiplicand values, during each one ot a 
plurality of phases of the iterative ® multiplication process, and in turn to receive as 
multipliers, inputs from: 
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said B register. 

an "on the tly" anticipating value (Yq) source, said anticipating value being usable 
as a multiplier to force first emining right-hand zero output characters in each iteration, 
and 

said ;V. register; 

the multiplicand parallel registers operative at least to receive in turn, values from 
the A. B. and N registers, and also said zero forcing (Yq) value; 

the apparatus further comprising a zero forcing (^o) detect device operative to 
generate a binary string operative to be a multiplier during a first multiplication phase 
and operative to be a multiplicand in a second multiplication phase; 

the apparatus being operable to obtain multiplicand values suitable for switching 
into the © accumulation device for the first multiplication phase, said values comprising 
firstly a zero value, secondly a value, A,, being a k character string segment of a 
multiplicand. A, and a third value /Vq, being the first emitting k characters of the 
modulus. jV: 

the apparatus further being operable to utilize a temporary result value, S, resulting 
from a previous iteration, to be © summated with a present result value emanating from 
the © accumulation device, to generate a panial result for a next-in-tum iteration; 

the apparatus further being operable to utilize multiplicand values to be input, in 
turn, into the ©accumulation device for a second multiplication phase comprising 
firstly a zero value, secondly an A\ operand, remaining in place from the first phase, and 
thirdly a Yo value having been anticipated in the first multiplication phase; 

multiplier values input into the multiplying device in the first phase being firstly 
an emitted string, Sn, said multiplication device being operable to multiply said string 
concurrently ® with a second ® multiplier value consisting of the anticipated Yq string 
which is simultaneously loaded character by character as it is generated into a preload 
multiplicand buffer for the second phase; 

two multiplier values operable to be input into the apparatus during a second 
phase being left hand n-k character values from the B operand, designated 5, and the left 
hand ;7-A' characters of the /V modulus, designated ^V, respectively: and 
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wherein said apparatus further comprises a multiplying flush out device operative 
in a last multiplication phase to transfer a left hand segment of a result value remaining 
in the © accumulation device into a result register. 

19. Apparatus as in claim 18, operable to perform ® multiplication on 
polvnomial based operands in a reverse mode, multiplying from MS characters to LS 
characters, and thereby being able to perform modular reduction without Montgomery 
type parasitic functions. 

20. An apparatus operative in modular multiplication to anticipate a Yq value using 
first emitted values of a multiplicand, and present inputs of a B multiplier, carry out 
values from a ©accumulation device, ©summation values from the ©accumulation 
device, the present values from a previously computed panial result, and carry out 
values from a © adder which © summates the result from the © accumulation device 
with the previous partial result. 

21. An apparatus as in claim 20 adapted to ensure that k first emitted values from 
the device are zeroes, said adaptation comprising anticipation of a next in turn Yq 
character using the following quantities: 

i / bit 5oui bits of a result of / bit by / bit mod 2^ ® multiplication of the 
right-hand character of an A-^ register times the 5d character of the B Stream, 
AiyB^i mod 2^; 

ii a first emitting carry out character from the @ accumulation device, 5'(COo); 

iii the / bit 5oui character from a second from the right-hand character emining 

cell of the e accumulation device, SO\ : 

iv a next in turn character value from the 5 stream. 

V a I bit carry out character from a Z output full adder, S(COz): 
vi a / bit Jo value, which is a negative multiplicative inverse of a right-hand 
character in the No modulus multiplicand register; 
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wherein values. Aq-D^ mod 2^ S{COo), SO\. Sd are ® added character to 
character together and "on the tly" 0 muhiplied by the Jo character to output a valid Vq 
zero-forcing anticipatory' character. 

22. Apparatus as in any one of claims 18 to 2U comprising at least one sensor 
operative to compare an output result to N. the mechanism operative to actuate a second 
subtractor on the output of the result register, thereby to output a modular reduced value 
which is limited congruent to the output result value, thereby avoiding any necessity to 
allot a second memor\' storage for a smaller result. 

23. Apparatus according to any one of claims 18 to 22, wherein a value which is 
a © summation of two multiplicands is loadable into a preload character buffer 
comprising at least a k character memory register whilst one of the said two 
multiplicands is concurrently loaded into another preload buffer. 

24. Apparatus with one © accumulation device, and an anticipating zero forcing 
mechanism, operative to perform a series of interleaved ® modular multiplications and 
squarings, and being adapted to perform concurrently the equivalent of three natural 
integer multiplication operations, such that a result is an exponentiation, 

25. Apparatus according to any one of claims 18 to 24, wherein next in turn 
used multiplicands are preloaded into a preload register buffer on the fly. 

26. An apparatus according to any one of claims 18 to 25, operable to © sum 
two multiplicands into at least a k character register whilst concurrently loading one of 
the two multiplicands into a preload buffer. 

27. Apparatus according to any one of claims 18 to 26, wherein apparatus 
buffers and registers are operative to be loaded with values from external memory 
sources and to be unloaded into an external memory source during computations, such 
that a maximum size of the operands is independent of sizes of said registers and said 
bu ffe rs. 
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28. Apparatus according to any one of claims 1 lo 27, comprising a memory 
register, said memory register being typically serial -single -character -in /serial-single- 
character- out. parallel- at- least- k -characters- in/ parallel-at-least-^-characters-out, 
serial -single- character -in/ parallel -at -least -k -characters -out, and parallel- k- 
characters- in/ serial- single- character- out. 

29. An apparatus according to any preceding claim operable to provide, during a 
fmal phase of a ® multiplication type iteration, at inputs of said multiplication device, a 
plurality of zero characters, which zero characters are operative to flush out a left hand 
segment of a memory of the carry save © accumulator. 

30. An apparatus as in any one of claims 18 to 29, operable to preload next in turn 
multiplicands into preload memory buffers on the fly, prior to their being required in an 
iteration. 

31. An apparatus according to any one of claims 18 to 29, operable to preload 
multiplicand values into preload buffers on the fly from a central storage memory. 

32. Microelectronic method for performing interleaved finite field ® modular 
multiplication of two integers A and B, so as to generate an output stream of ^ times B 
modulus /V. wherein a number of characters in a modulus operand register, n, is larger 
than a segment length of k characters, wherein the ® modular multiplication is 
performed in a plurality of interleaved iterations, wherein at each interleaved iteration, 
operands are input into a ® multiplying device, said operands comprising any one of /V, 
5, a previously computed partial result, 5, and a k character string segment of A, a 
multiplicand, the segments progressing from a first string segment .^o to ^ higher string 
segment .4ni-i- wherein each iterative result is © summated into a next temporary result, 
wherein at least first emitted characters of said iterations are zeroes, the method 
comprising the steps of: 

© summating into an © accumulation device, in turn one or two ot a plurality ot 
multiplicand values, during each one of a plurality of phases of the iterative ® 
multiplication process, and in turn receiving as multipliers, inputs trom: 



a B register, 

an "on the fly" anticipating value, Yq, operable as a multiplier to force first 
emitting right-hand zero output characters in each iteration, and 
an /V, register; 

generating a binary string operative to be a multiplier during a first multiplication 
phase and operative to be a multiplicand in a second multiplication phase; 

obtaining multiplicand values suitable for switching into the @ accumulation 
device for the first multiplication phase consisting firstly of a zero value, secondly a 
value, A„ which is a k character string segment of a multiplicand. A, and thirdly a value 
/Vq, being the first emitted k characters of the modulus, /V; 

obtaining a temporary result value, 5, resulting from a previous iteration, and 
® summating said temporary result value, 5, with a present result value emanating fi-om 
the © accumulation device, to generate a partial result for a next in turn iteration; 

obtaining multiplicand values for a second multiplication phase comprising firstly 
a zero value, secondly an Ai operand, remaining in place fi-om the first phase, and 
thirdly a Yq value having been anticipated in the first phase; 

utilizing multiplier values obtained in the first phase, said values being firstly an 
emitted string, Bq, and multiplying said string concurrently ® with a second ® 
multiplier value consisting of the anticipated Yq string as it is simultaneously loaded 
character by character whilst being generated into a preload multiplicand buffer for the 
second phase; 

obtaining two multiplier values during the second multiplication phase, said 
values being left hand n-k character values from the B operand, designated B, and the 
left hand n-k characters of the /V modulus, designated yV, respectively; and 

in a last multiplication phase transfemng a left hand segment of a result value 
remaining in the © accumulation device into a result register. 



33. A method according to claim 18 comprising computing Jq^Yq for 1=1 by 
resetting both A and B to zero and setting 5o = I . 

For the Applicant 
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